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The Google Books corpus, derived from millions of books in a range of major languages, would 
seem to offer many possibilities for research into cultural, social, and linguistic evolution. In a 
previous work, we found that the 2009 and 2012 versions of the unfiltered English data set as well 
as the 2009 version of the English Fiction data set are all heavily saturated with scientific and 
medical literature, rendering them unsuitable for rigorous analysis [Pechenick, Danforth and Dodds, 
PLoS ONE, 10, e0137041, 2015]. By contrast, the 2012 version of English Fiction appeared to be 
uncompromised, and we use this data set to explore language dynamics for English from 1820- 
2000. We critique an earlier method for measuring birth and death rates of words, and provide a 
robust, principled approach to examining the volume of word flux across various relative frequency 
usage thresholds. We use the contributions to the Jensen-Shannon divergence of words crossing 
thresholds between consecutive decades to illuminate the major driving factors behind the flux. 
We find that while individual word usage may vary greatly, the overall statistical structure of the 
language appears to remain fairly stable. We also find indications that scholarly works about fiction 
are strongly represented in the 2012 English Fiction corpus, and suggest that a future revision of 
the corpus should attempt to separate critical works from fiction itself. 


I. INTRODUCTION 

The incredible volume and free availability of the 
Google Books corpus mm make it an intriguing can¬ 
didate for linguistic research. In a previous work (3], we 
broadly explored the characteristics and dynamics of the 
English and English Fiction data sets from both the 2009 
and 2012 versions of the corpus. We showed that the 2009 
and 2012 English unfiltered data sets and, surprisingly, 
the 2009 English Fiction data set sets all become increas¬ 
ingly influenced by scientific texts throughout the 1900s, 
with medical research language being especially preva¬ 
lent. We concluded that, without sophisticated process¬ 
ing, only the 2012 English Fiction data set is suitable for 
any kind of analysis and deduction as it stands. We also 
described the library-like nature of the Google Books cor¬ 
pus which reflects word usage by authors with each book, 
in principle, represented once. Word frequency is thus a 
deceptive aspect of the corpus as it is not reflective of 
how often these words are read, as might be informed by 
book sales and library borrowing data, much less spoken 
by the general public. Nevertheless, the corpus provides 
an imprint of a language’s lexicon and remains worthy of 
study, providing all caveats are clearly understood. 

In this paper, we therefore focus on the 2012 version of 
the English Fiction data set. Fig.[l]shows the total num¬ 
ber of 1-grams for this data set between 1800 and 2000 
(1-grams are contiguous text elements and are more gen¬ 
eral than words including, for example, punctuation). An 
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Year 

FIG. 1: The logarithms of the total 1-gram counts for the 
Google Books corpus 2012 English Fiction data set. An expo¬ 
nential increase in volume is apparent over time with notable 
exceptions during wartime when the total volume decreases. 
(This effect is clearest during the American Civil War and 
both World Wars.) 


exponential increase in volume is apparent over time with 
notable exceptions during major conflicts when the total 
volume decreases. For ease of comparison with related 
work, and to avoid high levels of optical character recog¬ 
nition (OCR) errors due to the presence of the long s— 
e.g., “said” being read as “faid” i3J —we omit the first 
two decades and concern ourselves henceforth with 1- 
grams between the years 1820 and 2000. In releasing 
the original data set, Michel et al. JTj noted that English 
Fiction contained scholarly articles about fictional works 
(but not scholarly works in general), and we also explore 
this balance here. 

Many researchers have carried out broad studies of the 
Google Books corpus, examining properties and dynam¬ 
ics of entire languages. These include analyses of Zipf’s 
and Heaps’ laws as applied to the corpus gj, the rates 
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of verb regularization [T], rates of word “birth” and 
“death” and durations of cultural memory [5j, as well 
as an observed decrease in the need for new words in 
several languages However, most of the studies were 
performed before the release of the second version, and, 
to our knowledge, none have taken into account the sub¬ 
stantial effects of scientific literature on the data sets. 

Here, we are especially interested with revisiting work 
on word “birth” and “death” rates as performed in (5]. 
As we show below in Sec. 0 > the methods employed in [5] 
suffer from boundary effects, and we suggest an alterna¬ 
tive approach insensitive to time range choice. 

We do not, however, dispute that an asymmetry exists 
in the changes in word use. In our earlier work )3], 
we observed this asymmetry in the contributions to 
the Jensen-Shannon divergence (defined below) between 
decades, with most large contributions being account¬ 
ed for by words whose relative frequencies had increased 
over time. In this paper, we apply a similar information- 
theoretic approach to examine this effect for words mov¬ 
ing across fixed usage frequency thresholds. 

We structure the paper as follows. In Sec. [TTJ we cri¬ 
tique the method from [5] which examines the birth and 
death rates of words in an evolving, time-coded corpus. 
In Sec. Ill we recall and confirm a similar apparent bias 
toward increased usage rates of words from our prevoius 
paper. We then measure the flux of words across various 
relative frequency boundaries (in both directions) in the 
second English Fiction data set. We describe the use of 
the largest contributions to the Jensen-Shannon diver¬ 
gence between successive decades from among the words 
crossing each boundary as signals to highlight the spe¬ 
cific dynamics of word growth and decay over time. In 
Sec.|IV[ we display examples of these word usage changes 
and explore the factors contributing to the observed dis¬ 
parities between growth and decay. We offer concluding 
remarks in Sec. El 


II. CRITIQUE OF EARLIER WORK 

In [5], Petersen et al. examined the birth and death 
rates of words over time for various data sets in the 
first version of the Google Books corpus. They defined 
the birth year and death year of an individual word 
as the first and last year, respectively, that the giv¬ 
en word appeared above one twentieth its median rela¬ 
tive frequency. Excluded from considerations were words 
appearing in only one year and words appearing for the 
first time before 1700. The rates of word birth and death, 
respectively, were found by normalizing the numbers of 
births and deaths by the total number of unique words 
in a given year. 

Results typical to all data sets included decreased birth 
rates and increased death rates over time. These results 
are not implausible, and the results were noted to be 
qualitatively similar when one tenth the median frequen¬ 
cy is used as a threshold. But the very specific nature of 


the analysis—particularly the multiple temporal restric¬ 
tions on the words included in the analysis, the reliance 
on a particular proportion of each word’s median frequen¬ 
cy, and the ignoring of all but the first and last crossings 
over this threshold -raise questions as to the robustness 
of the method. 

Now, the common-sense interpretation of “word 
death” is clearly that a word falls out of usage (relative¬ 
ly) at a fixed point in history. Ignoring all but the first 
and last crossings of a threshold tied to both a word’s 
usage frequency and a specific time range appears to 
cause problems in this regard in [ 5 ] , and we find a bound¬ 
ary effect for death rates induced by the choice of the 
time range’s end point. To demonstrate this, we recreate 
the described analysis for the second version of English 
Fiction. 

We note that in our analyses, the relative frequencies 
are coarse-grained at the level of decades (see Methods 
below). We excluded words appearing in only one decade 
(rather than year) and words appearing before the 1820s 
(instead of 1700). Again, this more recent initial cut¬ 
off date accounts for the high frequency of OCR errors 
observed before 1820. These differences with [ 5 ] are not 
substantive, and allow us to re-examine their work and 
build out our own in meaningful ways. 

We compare the birth and death rates as observed 
recently versus historically by performing the analysis 
with three different endpoints imposed: the 1950s, the 
1970s, and the 1990s. We present the results of the recre¬ 
ation in Fig. [2] (c.f. Fig. 2 in ED¬ 
ITsing the 1990s cutoff, the observed birth rates are 
qualitatively similar to those found for various data sets 
(from the 2009 version of the corpus) in jS] and display 
spikes in the 1890s and 1920s (top panel in Fig. [ 2 } light 
gray). We see that birth rates are not affected by moving 
the “end of history” back to the 1950s or 1970s. 

The observed death rates with the 1990s boundary 
(bottom panel in Fig. [2j light gray) are also similar to 
that found in [5] , despite the lack of deaths detected dur¬ 
ing much of the 19th century. (Recall, we ignored words 
originating prior to 1820.) 

However, as the terminal boundary is moved back to 
the 1970s, what was originally a stable region between the 
1910s and 1940s turns into a apparent region of gradu¬ 
ally increasing word extinction, (bottom panel in Fig. 2j 
gray). As the boundary is moved further back to the 
1950s, the increase in death rate is no longer gradual 
(bottom panel in Fig. |2j dark gray). We thus see a clear 
dependence of the observations of the death rate on when 
the history of the corpus ends. Moving the “start of his¬ 
tory” forward in time similarly affects birth rates. 

Thus, while the method in [5] provides a reasonable 
approach to analyzing dynamics and asymmetries in the 
evolutionary dynamics of a language data set, the results 
for birth and death rates in [2] depend on when the exper¬ 
iment is performed. So motivated, we proceed to develop 
an approach that is robust with respect to time bound¬ 
aries. 
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FIG. 2: Birth and death rates, with definitions based on the 
method used in j5] for the 2012 version of English Fiction as 
observed between the 1820s and three different end-of-history 
boundaries. The lower panel shows death rates are affected 
by the choice of when history ends. Birth rates are similarly 
affected by moving the start of history. 


III. METHODS 

We coarse-grain the relative frequencies in the second 
English Fiction data set at the level of decades—e.g., 
between 1820-to-1829 and 1990-to-1999—by averaging 
the relative frequency of each unique word in a given 
decade over all years in that decade. (We weight each 
year equally.) This allows us to conveniently calculate 
and sort contributions to the Jensen-Shannon divergence 
(defined below) of individual 1-grams between any two 
time periods 


A. Statistical divergence between decades 

As in our previous paper [3], we examined the dynam¬ 
ics of the 2012 version of English Fiction by calcu¬ 
lating contributions to the Jensen-Shannon divergence 
(JSD) 0 between the distributions of 1-grams in two giv¬ 
en decades. We then used these contributions to resolve 
specific and important signals in dynamics of the lan¬ 
guage. (This material, which is presented in greater 
detail in our previous work, is outlined in sufficient detail 
below.) 

Given a language with 1-gram distributions P in the 
first decade and Q in second, the JSD between P and Q 


can be expressed as 

Djs(P\\Q) = H{M) - \ [H(P) + H{Q )\, (1) 

where M = |(P + Q) is a mixed distribution of the 
two years, and H(P) = — JT pt log 2 Pi is the Shannon 
entropy [5] of the original distribution. The JSD is sym¬ 
metric and bounded between 0 and 1 bit. These bounds 
are only observed when the distributions are identical 
and free of overlap, respectively. 

The contribution from the i th word to the divergence 
between two decades, as derived from Eq. [I] is given by 

D JS ,i{P\\Q) = mi- i(r i logr j + (2-r i )log(2-r i )), (2) 

where r 7 ; = pi/rrii, so that contribution from an individ¬ 
ual word is proportional to both the average frequency 
of the word and also depends on the ratio between the 
smaller and average frequencies. To elucidate the second 
dependency, we reframe the contribution as 

Djs4P\\Q) = m i C(r i ). (3) 

Words with larger average frequencies yield larger contri¬ 
bution signals as do those with smaller ratios, r,;, between 
the frequencies. A common 1-gram changing subtly can 
produce a large signal. So can an uncommon or new 
word given a sufficient shift from one decade to the next. 
C(ri ), the proportion of the average frequency contribut¬ 
ed to the signal, is concave (up) and symmetric about 
rj = 1, where the frequency remains unchanged yielding 
no contribution. If a word appears or disappears between 
two decades (e.g., pi = = 0 in the former case), then 

the contribution is maximized at precisely the average 
frequency of the word in question. 


B. Exploring asymmetric dynamics 

We observed in [3] that most large JSD contribu¬ 
tion signals are due to words whose relative frequencies 
increase over time. In this paper, we confirm and explore 
this effect. 

We texture our observations by examining JSD signals 
due to words crossing various relative frequency thresh¬ 
olds in either direction, as well as the total volume of 
word flux in either direction across these thresholds. It 
is both convenient and consistent to record flux over rel¬ 
ative frequency thresholds instead of rank thresholds. To 
demonstrate this consistency, we observe in Fig. [3] that 
rank threshold boundaries correspond to nearly constant 
relative frequency thresholds, with the exception of the 
top 1-gram (always the comma), which decreases grad¬ 
ually in relative frequency. For thresholds of 10 -5 and 
below, we omit signals corresponding to references to spe¬ 
cific years, since such references would otherwise over¬ 
whelm the charts for these thresholds. 
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FIG. 3: Rank threshold boundaries correspond to nearly 
constant relative frequency threshold boundaries over many 
orders of magnitude, with the exception of the top 1-gram 
(always a comma), which decreases in relative frequency. The 
observed stability demonstrates that Zipf’s law remains large¬ 
ly unchanged for the English Fiction (2012) data set, even 
though individual words may vary greatly in rank over time. 


IV. RESULTS AND DISCUSSION 

As seen in Fig.[4j more than half of the JSD between a 
typical given decade and the next is due to contributions 
from words increasing in relative usage frequency. The 
JSDs between 1820s, 1840s, and 1970s and their succes¬ 
sive decades are the only exceptions. Moreover, when the 
time differential is increased to three decades, no excep¬ 
tions remain. This confirms asymmetry exists between 
signals for words increasing and decreasing in relative 
use. We note relative extrema of the inter-decade JSD 
in the vicinity of major conflicts. Between the 1860s and 
successive decades, words on the rise contribute substan¬ 
tially to the JSD. This is consistent with words not rela¬ 
tively popular during wartime (specifically the American 
Civil War) being used more frequently in peacetime. A 
similar tendency holds for the JSD between the 1910s 
(World War I) and the 1920s. This is not as apparent 
in the JSD between the 1910s and the 1940s, possibly 
because the 1940s coincide with World War II. The abso¬ 
lute maximum for the single-decade curve corresponds to 
the divergence between the 1950s and 1960s. This sug¬ 
gests a strong effect from social movements. (For the 
3-decade split, the absolute peak comes from the JSD 
between the 1940s and 1970s.) 

We next consider flux between decades across relative 
frequency thresholds of powers of 10 from 10 -2 down to 
10 ~ 6 . 

In Fig. [5j we display the volume of flux of 1-grams 
in both directions across relative frequency thresholds of 
powers of 10 from 10 -4 down to 1CD 7 . We first describe 
the very limited flux across the 10 -2 and 10” 3 boundaries 
(not shown in Fig. [5]), and then investigate the richer 
transitions for the lower thresholds for 10” 4 , 10 —5 , 10 -6 , 
and 1CT 7 . 

Flux across the 10” 2 boundary between consecutive 
decades is almost nonexistent during the observed period. 
Between the 1820s and 1830s, the semicolon falls below 
the threshold. Between the 1840s and 1850s, “I” rises 


FIG. 4: Percent of JSD in English Fiction (version 2) due 
to words increasing in relative frequency of use for successive 
decades (dark gray), and decades three apart (light gray; e.g., 
1990s versus 1960s). The contribution for successive decades 
is nearly always more than half—the exceptions are between 
the 1820s, 1840s, and 1970s, and their successive decades. For 
decades three apart, the contribution is always greater than 
50%. The JSD between successive decades also shows peaks 
in the vicinity of major conflicts. 




Year 


FIG. 5: Total number of words (log 10 ) crossing relative fre¬ 
quency thresholds of 10 -4 , 10~ 5 , 10 -6 , and 10”' in both 
directions between each decade and the next decade. For each 
threshold, the upward and downward flux roughly cancel. For 
either direction of flux, there appears to be little qualitative 
difference between the three smallest thresholds for which the 
downward flux between the 1950s and the 1960s is a mini¬ 
mum, the downward flux increases over the next two pairs 
of consecutive decades, then it dips again between the 1980s 
and 1990s. For the highest threshold, the increase between 
the 1960s and 1970s and the next pair of decades is more 
noticeable for the upward flux, as is the decrease between the 
last two pairs of decades. 


above the boundary. Between the 1910s and 1920s, “was” 
rises across. This is the entirety of the flux across 10 —2 , 
which shows the regime of 1-grams above this frequency 
(roughly the top 10 1-grams) is quite stable. The eleven 
1-grams above threshold in the 1990s in decreasing order 
of frequency are: the comma the period “the”, 
quotation marks, “to”, “and”, “of”, “a”, “I”, “in”, and 
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FIG. 6: Words crossing relative frequency threshold of 10 -3 between consecutive decades. Signals for each pair of decades 
are sorted and weighted by contribution to the JSD between those decades. Bars pointing to the right represent words that 
rose above the threshold between decades. Bars pointing left represent words that fell. In parentheses in each title is the total 
percent of the JSD between the given pair of decades that is accounted for by flux over the 10~ 3 threshold. 


“was”. 

The set of 1-grams with relative frequencies above 10 -3 
(roughly to the top 100 1-grams) is also fairly stable. The 
flux of 1-grams across this boundary between consecu¬ 
tive decades is entirely captured by Fig. [6] Parentheses 
drop in (relative frequency of) use between the 1840s 
and 1850s and cross back over the threshold after the 
American Civil War (between the 1860s and 1870s). The 
same is true for before and after World War II (between 
the 1930s and 1940s and between the 1940s and 1950s, 
respectively). Beyond these, the flux is entirely due to 
proper words (not punctuation). For example, “made” 
fluctuates up and down over this threshold repeatedly 
over the course of a century. Between the 1870s and the 
1880s, “made”, which sees slightly increased use, is the 
only word to cross the threshold. The most crossings is 
12, which occurs between the first two decades. Also, 
“great” struggled over the first 5 decades and eventually 
failed to remain great by this measure. “Mr.” fluctuated 
across the threshold between the 1830s and 1910s. More 
recently (since the 1930s), “They” has been making its 
paces up and down across the threshold. 

For each threshold between 10 -4 and 10~ 7 , the upward 
and downward flux roughly cancel, which is consistent 
with Fig. [3] For both upward and downward flux, 
there appears to be little qualitative difference between 
the three smallest thresholds. For these thresholds, the 
downward flux between the 1950s and the 1960s is a mini¬ 
mum, the downward flux increases over the next two pairs 
of consecutive decades, then it dips again between the 
1980s and 1990s. For the highest threshold, the increase 
between the 1960s and 1970s and the next pair of decades 
is more noticeable for the upward flux, as is the decrease 


between the last two pairs of decades. 

In the experiment recreated in Fig. [2] the word birth 
rate initially exceeds the death rate by three orders of 
magnitude, and this gap declines gradually over the next 
two centuries. However, with respect to words fluctuat¬ 
ing across relative frequency thresholds in opposite direc¬ 
tions, we see no strong evidence of such marked asymme¬ 
try during any long period of time. With respect to total 
contributions to the JSD between consecutive decades, 
there is typically some bias toward toward words with 
increased relative use as seen in Fig. [4] but the difference 
need never be described in orders of magnitude. 

To address the fluctuations during the last couple of 
decades, we begin by displaying in Fig. [7] the top 60 flux 
words between the 1970s and the 1980s sorted by contri¬ 
butions to the JSD between those decades. Note that this 
pair of decades corresponds to both a dip (below 50%) 
in the proportion of rising word contributions to the JSD 
and to an increase in the volume of downward flux (as 
well as upward flux for high thresholds). In Fig. [s| we 
show all 55 flux words between the 1980s and the 1990s. 

Between each pair of decades, we see reduced relative 
use of particularly British words, including “England” 
between the first two decades and “King”, “George”, 
and “Sir” between the latter two. We also see reduced 
use of more formal-sounding words, such as “character”, 
“manner”, and “general” between the first two decades 
and “suppose”, “indeed”, and “hardly” between the lat¬ 
ter two. Increasing are physical and emotional words. 
Those between the first two decades include “stared”, 
“breath”, “realized”, “shoulder” and “shoulders”, “cof¬ 
fee”, “guess”, “pain”, and “sorry.” Between the latter 
two, we see “chest”, “skin”, “whispered”, “hit”, “throat”, 
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JSD contributions (flux): 1970s to 1980s 


JSD contributions (flux): 1980s to 1990s 
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FIG. 7: Words crossing relative frequency threshold of ICG 4 
between the 1970s and 1980s. Signals for each pair of decades 
are sorted and weighted by contribution to the JSD between 
those decades. Bars pointing to the right represent words that 
rose above the threshold between decades. Bars pointing left 
represent words that fell. (The first signal is the asterisk .) 


“hurt”, “control”, and “lives.” Also included are “phone” 
and “parents.” 


In Figs. [9] and BI 


we display the top 60 flux words, 
not counting references to years, across the 10~ 5 thresh¬ 
old between the same decades. Many of the words 
declining below the threshold between the 1970s and 


FIG. 8: Words crossing relative frequency threshold of ICG 4 
between the 1980s and 1990s. See the caption for Fig. [7] for 
details. 


1980s are unusual spellings such as “tho”, proper names 
like “Balzac”, or words from non-English languages like 
“une.” Increasing across this threshold between the 
first two decades are a plethora of mostly female prop¬ 
er names, with “Jessica” and “Megan” leading. Also 
seen are “KGB” and “jeans.” (“KGB” decreases in the 
1990s, as does “Russians.”) Increasing between the 1980s 
and 1990s are a few proper names; however, most of 
the signals here are social and sexual in nature, and in 
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JSD contributions (flux): 1970s to 1980s 
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JSD contributions (flux): 1980s to 1990s 
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FIG. 9: Words (not counting references to years) crossing 
relative frequency threshold of 10 -5 between the 1970s and 
1980s. See the caption for Fig. [7] for details. 


FIG. 10: Words (not counting references to years) crossing 
relative frequency threshold of 10 -5 between the 1980s and 
1990s. See the caption for Fig. [7] for details. 


part point to the inclusion of academic, literary criti¬ 
cism. These include “lesbian” and “lesbians”, “AIDS”, 
and “gender” in the top positions. Also included are both 
“homosexuality” and the more general “sexuality.” We 
also see “girlfriend”, “boyfriend”, “feminist”, and “sexy.” 

For contrast, we show in Fig. 12 the flux across a 
threshold of 10~ 6 between the 1980s and 1990s (again, 
not counting years). In particular, while increases in 
“HIV” and “bisexual” make the list (similarly to many 


signals in Fig. 101, as do “fax”, “laptop”, and “Inter¬ 


net”, a great swath of the signals are accounted for by 
one franchise. We note increases in “Picard”, “TNG”, 
“Sisko”, and “DS9.” These latter signals should serve 
as a reminder that the word distributions in library-like 
Google Books corpus [3] , even for fiction, do not remotely 
resemble the contents of normal conversations (at least 
not for the general population). However, we do observe 
signals arising at this threshold from factors external to 
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JSD contributions (flux): 1970s to 1980s 


JSD contributions (flux): 1980s to 1990s 
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FIG. 11: Words (not counting references to years) crossing 
relative frequency threshold of 10 -G between the 1970s and 
1980s. See the caption for Fig. [7] for details. 
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FIG. 12: Words (not counting references to years) crossing 
relative frequency threshold of 10 -6 between the 1980s and 
1990s. See the caption for Fig. [7] for details. 


the imaginings of specific authors. It would therefore be 
premature to dismiss the contributions at this threshold 
because of an apparent overabundance of “Star Trek.” In 
fact, since “The Next Generation” and “Deep Space 9” 
aired precisely during these two decades, an abundance 
of “Star Trek” novels in the English Fiction data set is 
actually quite encouraging, because these novels do exist, 
are available in English, and are (clearly) fiction. 

For consistency, we also include the flux (omitting 


years) across this threshold between the 1970s and 1980s 
in Fig. EH While not particularly topical, we do see 
“AIDS” increase above this threshold a decade prior to 
its increase over 10 -5 as seen in Fig. 

The texture of the signals changes as we dial down 
the frequency threshold. We typically find that thresh¬ 
olds of 10~ 4 and above produce signals with little to no 
noise. This is not surprising since this relative frequen¬ 
cy roughly corresponds to rank threshold for the 1000 
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JSD contributions (flux): 1930s to 1940s 
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FIG. 13: Words crossing relative frequency threshold of 1CP 4 
between the 1930s and 1940s. See the caption for Fig. [7] for 
details. 
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FIG. 14: Words (not counting references to years) crossing 
relative frequency threshold of 10 -5 between the 1930s and 
1950s. See the caption for Fig. [7] for details. 


most common words (see Fig. [3]) in the data set. Using a 
threshold of 10 -5 (fewer than 10,000 words fall above this 
frequency in any given decade), we see some noise (most¬ 
ly in the form of familiar names), but still observe many 
valuable signals. Only when the threshold is reduced to 
10~ 6 does the overall texture of the signals become ques¬ 
tionable as a result of a variety of proper nouns far less 
familiar than those observed with the previous threshold. 
However, at this threshold, we also observe several early 
signals of real social importance. 


Curiously, between the 1930s and 1940s the volume 
of flux across each threshold is not atypical (see Fig. [5]) . 
Moreover, the asymmetry between the JSD contributions 
between those decades is very low. Yet it is obvious 
that we should expect signals of historical significance 
between these two decades. In Figs. [13] and [l4j we see 
words crossing the 10~ 4 and 10~ 5 thresholds, respectively 
(with references to years omitted in Fig[l4|). For the high¬ 
er threshold, only 56 words cross. The most noticeable 
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JSD contributions (flux): 1960s to 1970s 
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FIG. 15: Words (not counting references to years) crossing 
relative frequency threshold of 10 -5 between the 1960s and 
1970s. See the caption for Fig. [7] for details. 


such words that are more commonly used in the 1940s are 
“General” and “German.” Also, “killed” appears in this 
list. Words used less frequently include “pleasure”, “gar¬ 
den”, and “spirit.” For the lower threshold, we see the 
signals from prolific authors as in our previous paper [3] , 
particularly Upton Sinclair’s character, Lanny Budd. We 
also see more Nazis (“Nazi” and “Nazis”). 

Last, we include one of the more colorful examples. 
In Fig. 15 we show signals (not including years) for 


words crossing the 10 -5 threshold between the 1960s and 
1970s. Profanity dominates. We see more references to 
The World According to Garp (“Garp”) and “Star Trek”, 
again (“Kirk” this time). We also see more “computer”, 
“TV”, and “plastic.” Signals also appear for “blacks” 
and “homosexual”, for drugs (“drug” and “drugs”), and 
(plausibly) for the War on Drugs (“enforcement” and 
“cop”). 

We refer to reader to our paper’s Online Appendices 
at http://www.compstorylab.org/share/papers/ 
pechenick2015b/ for figures representing flux across 
relative frequency thresholds of 10 -4 , 10~ 5 , and 10~ 6 
between consecutive decades over the entire period 
analyzed (the 1820s to the 1990s). 


V. CONCLUDING REMARKS 

We recall from [6] and from our own work {3] (Fig. 7d) 
that the rate of change of given language tends to slow 
down over time. This applies to the 2012 English Fiction 
data set and is not contested by us in the present paper. 
In the critiqued paper [5j , it was suggested that the birth 
and death rates of words can be calculated in an intuitive, 
albeit very specific manner. This experiment produces 
birth rates that begin vastly higher than death rates with 
both rates converging over time to around 1%. Howev¬ 
er, we have seen that these rates converge to roughly the 
same values at the end of the available history, regardless 
of when that is— i.e., the experiment depends on when 
you perform it, and recent results always appear qualita¬ 
tively similar. 

Beyond this boundary issue, we find another cause for 
concern. When the increased usage bias in the JSD con¬ 
tributions and the overall and directed volumes of flux 
are taken into account, we do not observe even the ini¬ 
tial orders-of-magnitude gap between so-called birth and 
death rates. Rather, the JSD bias toward increased rel¬ 
ative use of words is within one order of magnitude, and 
the flux across thresholds is typically balanced. 

In fact, this latter point appears to be a fundamental 
facet of this data set. As we see in Fig. [3j the num¬ 
ber of words above each threshold is roughly constant. 
This stability of the rank-frequency relation compels the 
observed balancing act (and is consistent with a stable 
Zipf law distribution 0). Previously in [3] (Fig. 5d), 
we have seen the divergence between a given year and 
a target year tends to increase gradually with the time 
difference. This is not true when, for example, the tar¬ 
get year—e.g., 1940 falls during a major war, in which 
case we see a spike in divergence. However, as the target 
year exits this period—e.g. enters the 1950s—the spike 
settles back into the original gradual growth pattern. It 
is plausible based on these earlier observations and the 
observations in this paper that the distribution of the 
language is self-stabilizing: the overall shape of the dis¬ 
tribution does not appear to change drastically with time 
or with the total volume of the data set. As old words 
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fall out of favor, new words inevitably appear to fill in 
the gaps. 

Furthermore, despite the fact that the divergence 
between consecutive years has been observed to decay 
over time, we find no shortage of novel word introductions 
during the most recent decades (which have the lowest 
decade-to-decade JSDs). This apparent dissonance clear¬ 
ly invites further investigation. 

Finally, while extremely specific fiction can be of great 
interest—whether it be in the form of war novels or vol¬ 
umes from the “Star Trek" franchise—vocabulary from 
these works is more easily studied when placed in proper 
context. Dialing down the relative frequency threshold 
across several orders of magnitude helps to capture this 


distinction. However, further experimentation is called 
for, since an automatic means of separating specific sig¬ 
nals from the more general signals (e.g., “Star Trek” 
from social movements) could allow both a more intuitive 
grasp of the linguistic dynamics and might, ideally, allow 
investigators to hypothesize causal relationships between 
exogenous and endogenous drivers of the language. 
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